Statistical language modeling for speech disfluencies

نویسندگان

  • Andreas Stolcke
  • Elizabeth Shriberg
چکیده

Speech disfluencies (such as filled pauses, repetitions, restarts) are among the characteristics distinguishing spontaneous speech from planned or read speech. We introduce a language model that predicts disfluencies probabilistically and uses an edited,fluent context to predict following words. The model is based on a generalization of the standard N-gram language model. It uses dynamic programming to compute the probability of a word sequence, taking into account possible hidden disfluency events. We analyze the model’s performance for various disfluency types on the Switchboard corpus. We find that the model reduces word perplexity in the neighborhood of disfluency events; however, overall differences are small and have no significant impact on recognition accuracy. We also note that for modeling of the most frequent type of disfluency, filled pauses, a segmentation of utterances into linguistic (rather than acoustic) units is required. Our analysis illustrates a generally useful technique for language model evaluation based on local perplexity comparisons. 1. MOTIVATION AND OVERVIEW Speech disfluencies (DFs) are prevalent in spontaneousspeech, and are among the characteristics distinguishing spontaneous speech from planned or read speech. DFs are one of many potential factors contributing to the relatively poor performance of state-ofthe-art recognizers on this type of speech, e.g., as found in the Switchboard [2] corpus. Past work on disfluent speech has focused on disfluency detection, using either acoustic features [7, 6] or recognized word sequences [1, 3]. Our goal in this work is to develop a statistical language model (LM) that can be used for speech decoding or rescoring, and that improves upon standard LMs by explicitly modeling the most frequent DF types. The main reason to expect that DF modeling can improve the LM is that standard N-gram models are based on word predictions from local contexts, which are rendered less uniform by intervening DFs. Other researchers have recently started exploring approaches to DF modeling based on similar assumptions [4, 8]. Section 2 describes a simple N-gram-style DF model, based on the intuition that DF events need to be predicted and edited from the context to improve the prediction of following words. Section 3 compares the DF model with a baseline LM, in terms of both perplexities and word error rates on Switchboard data. The emphasis is on a detailed analysis of the model at DF and following word positions. Section 4 provides a general discussion of the results. 2. THE MODEL 2.1. Disfluency types Following [9], DFs can be classified based on how the actual utterance must be modified to obtain the intended fluent utterance, i.e., the utterance a speaker would produce if asked to repeat his or her utterance. The types can be characterized by the type of editing required. Filled pauses (FP) The pause filler (typically “uh” or “um”) must be excised. SHE UH GOT REAL LUCKY THOUGH --> SHE GOT REAL LUCKY THOUGH Repetitions (REP) Contiguous repeated words must be removed. IT’S A IT’S A FAIRLY LARGE COMMUNITY --> IT’S A FAIRLY LARGE COMMUNITY Deletions (DEL) Words without correspondence in the repaired word sequence must be deleted. I DID YOU HAPPEN TO SEE ... --> DID YOU HAPPEN TO SEE ... We know from prior work [9] that these three types of DF are the most frequent across a variety of spontaneous speech corpora, accounting for over 85% of DF tokens in the Switchboard corpus. See [9] for a description of other, less frequent, types of DF that are not modeled explicitly in our LM. For example, we are not modeling word substitutions or speech errors. 2.2. The Cleanup Model The central assumption incorporated in our DF language model is that probability estimates for words after a DF are more accurate if conditioned on the intended fluent word sequence. A secondary assumption is that DFs themselves can be modeled as word-like events, each having a probability conditioned on the context. A standard language model, by contrast, would look only at the surface string of words and assign word probabilities in a strictly sequential manner. Because of the central assumption, we call our DF model the ‘Cleanup Model.’ It is implemented as a standard backoff trigram model with the following three modifications to account for DFs. 1. Words following a DF event are conditioned on the cleanedup, fluent version of the context. Filled pauses are removed 1DF frequencies in Switchboard were estimated from a hand-labeled subset of 60 conversation sides, containing 40,500 words. The coverage figure takes into account the further limits on modeled repetitions and utterance-medial deletions described below. To appear in Proc. ICASSP-96, May 7-10, Atlanta, GA 1 c IEEE 1996 from contexts, as is the sequence of extraneous words in repetitions and deletions. For example, the probability estimate for “WANT” following “BECAUSE I I” would be P (WANTjBECAUSE I REP1) = P (WANTjBECAUSE I) ; where REP1 denotes a repetition event. The repeated “I” is deleted from the context. 2. Disfluencies are represented by probabilistic events occurring within the word stream, some of which are hidden from direct observation. For simplicity, we model only the most prevalent subtypes for each DF class, namely filled pauses UH and UM, repetitions of one or two words (REP1, REP2), deletions at the beginning of a sentence (SDEL), and other oneor twoword deletions (DEL1, DEL2). 3. Just as words, DFs are treated as events that are assigned probabilities conditioned on their context. The contexts themselves are subject to DF cleanup as described above. For example, P (REP1jBECAUSE I) is the probability of repeating “I” after “BECAUSE.” By representing DFs simply as another type of N-gram event, we allow DFs to be conditioned on specific lexical contexts, so that simple word-based regularities in their distribution can be captured. Furthermore, because of its simple N-gram character, the model does not embody specific assumptions or constraints about the distribution of DF events. 2.3. Probability computation To account for the hidden DF events potentially occurring between any two words, a forward computation is carried out to find the probability of a sentence prefixP (w1w2 : : : wk). Conditional word probabilities are then computed as P (wk+1jw1 : : : wk) = P (w1 : : : wk+1) P (w1 : : : wk) : If the underlying N-gram model is a trigram, it is sufficient to keep eight states for each word position, according to whether the DF prior to wk was NODF (none), FP (filled pause), SDEL, DEL1, DEL2, REP1, REP2, or the second position after a REP2 event. To illustrate, the partial computation involving just the NODF and REP1 states is shown here. P (w1 : : : wkNODFwk+1) = P (w1 : : : wk 1NODFwk) p(wk+1jwk 1wk) +P (w1 : : : wk 1REP1wk) p(wk+1jwk 2wk 1) P (w1 : : : wkREP1wk+1) = (wk; wk+1) [p(w1 : : : wk 1NODFwk) p(REP1jwk 1wk) +P (w1 : : : wk 1REP1wk) p(REP1jwk 2wk 1)] where (wi; wj) = 1 if wi = wj , and 0 otherwise. Trigram probabilities are denoted by p( j ); these are obtained through the usual backoff procedure [5]. The total prefix probability is then computed asP (w1 : : : wk) =XX P (w1 : : :Xwk) ; where X ranges over the hidden states representing the disfluency types (including NODF). 2.4. Estimation The backoff N-gram probabilities in the model are estimated from N-gram counts, including counts of the DF events. We used standard Good-Turing discounting in the backoff for both baseline and DF trigram models. For experiments reported here involving hidden DF events, we used a subsetof the Switchboard corpus that was hand-annotated for disfluencies as well as for linguistic segments. In the absence of hand-annotated training data, an iterative reestimation (EM) algorithm could be used to estimate the N-gram probabilities for hidden DF events. When counting N-grams for the DF model, the same context modifications used in the DF cleanup operations must be performed on the training data. For example, the word sequence SHE UH GOT REAL LUCKY is counted as having the following trigrams: SHE UH SHE GOT SHE GOT REAL GOT REAL LUCKY Note that the trigrams SHE UH GOT UH GOT REAL which would be generated for a standard trigram LM are not generated for the DF model. Because DF and word events are represented uniformly as Ngrams in the model, the standard estimation procedure will normalize DF and non-DF event probabilities. This is a convenient simplification over alternative approaches in which DFs are modeled separately from the fluent word sequences. 3. RESULTS AND ANALYSIS 3.1. Overall results We trained a trigram model for FP, REP, and DEL disfluencies as described above, using 1.4 million words of Switchboard data labeled for DF events (see note 2). The model was then evaluated on a test set of 17,500 words. Table 1 compares baseline trigram and DF models. Table 1. Overall results Model Perplexity Word error Baseline trigram 119.1 50.21% DF trigram 120.9 50.23% As can be seen, there is no significant difference in recognition word error rates. While this may be due to a number of factors (some of which we discuss in Section 4), we would have expected at least a reduction in perplexity for the DF model; this was not the case. We wanted to know whether this was because our underlying assumptions were wrong, or whether it was due to other factors, so we decided to analyze the DF model performance in detail. We note with regard to these and later results that some types of disfluencies may contain word fragments (from speakers cutting themselves off in mid-word). According to [9], 20 to 25% of repetitions and deletions in Switchboard contain word fragments; however, filled pauses, as classified here, never involve words fragments. Fragments are usually not part of the vocabulary of current recognizers, and are not modeled in our system. They 2A preliminary version of annotated Switchboard data was made available to the 1995 Johns Hopkins Language Modeling Workshop; the LDC will release a final version. 3Both baseline and DF models were trained on the same data, which corresponds to only a portion of the full training corpus. Therefore, the perplexity figures are higher here than in some of the comparisons below.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling of Pronunciation, Language and Nonverbal Units at Conversational Russian Speech Recognition

The main problems of a conversational Russian speech recognition system development are variability of pronunciation, free word-order in sentences and presence of speech disfluencies. In the paper, pronunciation variability is modeled by creation of multiple word transcriptions. A syntacticstatistical language model that takes into account long-distant word dependencies is proposed for Russian ...

متن کامل

Repurposing Corpora for Speech Repair Detection: Two Experiments

Unrehearsed spoken language often contains many disfluencies. If we want to correctly interpret the content of spoken language, we need to be able to detect these disfluencies and deal with them appropriately. In the work described here, we use a statistical noisy channel model to detect disfluencies in transcripts of spoken language. Like all statistical approaches, this is naturally very data...

متن کامل

Using a Uniform-Weight Grammar to Model Disfluencies in Stuttered Read Speech: A Pilot Study

Stuttering is a speech disorder characterized by certain types of speech disfluencies, such as sound repetitions, which are frequent enough to be disruptive. Speech therapists frequently use manual counts of these speech disfluencies to diagnose whether a child stutters and to track improvement through a treatment program. However, these counts are subjective, inconsistent, and prone to error. ...

متن کامل

Handling Disfluencies in Spontaneous Language Models

In automatic speech recognition, a stochastic language model (LM) predicts the probability of the next word on the basis of previously recognized words. For the recognition of dictated speech this method works reasonably well since sentences are typically well-formed and reliable estimation of the probabilities is possible on the basis of large amounts of written text material. However, for spo...

متن کامل

Pseudo-Syntactic Language Modeling for Disfluent Speech Recognition

Abstract Language models for speech recognition are generally trained on text corpora. Since these corpora do not contain the disfluencies found in natural speech, there is a train/test mismatch when these models are applied to conversational speech. In this work we investigate a language model (LM) designed to model these disfluencies as a syntactic process. By modeling selfcorrections we obta...

متن کامل

Speech disfluencies of preschool-age children who do and do not stutter.

PURPOSE The goals of the present study were to investigate whether (1) the speech disfluencies of preschool-age children are normally distributed; (2) preschool-age children who do (CWS) and do not stutter (CWNS) differ in terms of non-stuttered disfluencies; (3) age, gender, and speech-language ability affect the number and type of disfluencies children produce; and (4) parents' expressed conc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996